renv::install("bioc::dada2")
renv::install("targets")
renv::install("tarchetypes")
renv::install("quarto")
renv::init()
renv::snapshot()Sequencing data processing for the SynCom co-culture experiment
Objective
The data was obtained after sequencing of the 16S rRNA gene V4 and sequencing by Novogene using Illumina (Novaseq?) PE250. Sequences will be demultiplexed and then processed using dada2 within a targets workflow.
Setting up {renv}, {targets} and packages
I am installing the different packages I need using {renv} then initialise renv to created the renv.lockfile which records the packages version.
I also create the _targets.R file which will contain the analysis workflow:
targets::use_targets()Then we can visualise the workflow network
library(targets)
tar_visnetwork()Read demultiplexing
The run contains multiplexed samples and only one primer pair. The full sequences of primers + spacer + tag are in data/tag_primers_16SV4_fwd.fasta and data/tag_primers_16SV4_rev.fasta.
The raw data were downloaded from Novogene on BlueBEAR at ~/cambonm-oak-syncoms/250730_SynCom_stability_experiment/01_raw_data
MD5sum of the file is checked and the file is then uncompressed
md5sum -c MD5.txt
> X204SC24085297-Z01-F002.tar: OK
tar -xvf X204SC24085297-Z01-F002.tar
cd X204SC24085297-Z01-F002/01.RawData/pool_1/
md5sum -c MD5.txt
> pool_1_FKDN250298699-1A_22NW5FLT4_L5_1.fq.gz: OK
> pool_1_FKDN250298699-1A_22NW5FLT4_L5_2.fq.gz: OKBecause I sent a pool of sequences to Novogene which then get ligated to Illumina adapters, fragments are inserted in random orientation and need to be demultiplexed both ways and with both primer and tag!
In later versions of cutadapt, there is the --revcomp option to automatically look for primers in both orientation.
I tested the option and my old loop based version and seemed to get more or les the same unknown-unknown files size so I’ll just go ahead with the --revcomp option from now on.
Running cutadapt
I ran Cutadapt v4.9 on blueBEAR:
#!/bin/bash
#SBATCH -o cutadapt.revcomp.demultiplex.o.%J # Job output file
#SBATCH -e cutadapt.revcomp.demultiplex.e.%J # Job error file
#SBATCH --ntasks=4 # number of parallel processes (tasks)
#SBATCH --qos=bbdefault # selected queue
#SBATCH --time=72:00:00 # time limit
set -e
module purge
module load bear-apps/2023a
module load cutadapt/4.9-GCCcore-12.3.0
## 16SV4
export input_dir=/rds/projects/c/cambonm-oak-syncoms/250730_SynCom_stability_experiment/01_raw_data/PE250/X204SC24085297-Z01-F003/01.RawData/pool_1/
export output_dir=16SV4_revcomp
mkdir $output_dir
# --cores=0 auto detects the number of available cores
# --revcomp searchs for primers and adapters in both orientations
cutadapt \
-e 0.15 --no-indels \
--cores=0 \
--revcomp \
-g file:tag_primers_16SV4_fwd.fasta \
-G file:tag_primers_16SV4_rev.fasta \
-o $output_dir/{name1}-{name2}.R1.fastq.gz -p $output_dir/{name1}-{name2}.R2.fastq.gz \
$input_dir/pool_1_1.fq.gz $input_dir/pool_1_2.fq.gzRead number per samples after demutliplexing
I counted the number of read per sample after demultiplexing with the following code:
#!/bin/bash
#SBATCH -o count.demultiplex.o.%J # Job output file
#SBATCH -e count.demultiplex.e.%J # Job error file
#SBATCH --ntasks=1 # number of parallel processes (tasks)
#SBATCH --qos=bbdefault # selected queue
#SBATCH --time=00:05:00 # time limit
set -e
module purge
module load bear-apps/2023a/live
touch count_demultiplexed.txt
for f in 16SV4_revcomp/*R1.fastq.gz
do
c=$(zcat $f | grep @ | wc -l)
echo $f $c >> count_demultiplexed.txt
doneThen plot the results
tar_load(fig_read_counts_demultiplex)
fig_read_counts_demultiplexEvery single sample has more than 1000 reads including the negative controls and unused tags! Because there were quite few samples in the pool the sequencing depth is very high. I will need to be quite stringent in the cleaning steps.
Up to this point I ran the script manually and not within the targets workflow because I am not to sure how to deal with running bash scripts on slurm withing targets, but I’ll try and improve this in the future.
The {targets} workflow starts below ⬇️
Read quatlity filtering
First the demultiplexed reads are filtered based on their quality. Sequence with unknown bases and more than 2 expected errors are removed.
First we get the path of the demultiplexed files,
function(path, metadata) {
fnFs <- (list.files(path, pattern=".R1.fastq", full.names = TRUE))
fnRs <- sort(list.files(path, pattern=".R2.fastq", full.names = TRUE))
sample.names <- metadata$sample_ID[match(basename(fnFs), basename(metadata$file))]
fnFs <- fnFs[!is.na(sample.names)]
fnRs <- fnRs[!is.na(sample.names)]
sample.names <- sample.names[!is.na(sample.names)]
return(list("fnFs" = fnFs, "fnRs" = fnRs, "sample_names"=sample.names))
}
Then we filter the sequences and write them in the output folder.
function(fns_paths, outp) {
filtFs <- file.path(outp, "01_filtered", paste0(fns_paths[["sample_names"]], "_F_filt.fastq.gz"))
filtRs <- file.path(outp, "01_filtered", paste0(fns_paths[["sample_names"]], "_R_filt.fastq.gz"))
names(filtFs) <- fns_paths[["sample_names"]]
names(filtRs) <- fns_paths[["sample_names"]]
count_denoise <- filterAndTrim(fns_paths$fnFs, filtFs, fns_paths$fnRs, filtRs, #truncLen=c(240,160),
maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
compress=TRUE, multithread=FALSE) # Multicore is not working on RStudio server I am not sure why
return(list("filtFs"=filtFs, "filtRs"=filtRs, "count_denoise"=count_denoise))
}
DADA2 denoising
Information from the dada2 tutorial:
The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).
denoisefunction(filt_paths) {
filtFs <- filt_paths[["filtFs"]]
filtRs <- filt_paths[["filtRs"]]
errF <- learnErrors(filtFs, multithread=FALSE)
errR <- learnErrors(filtRs, multithread=FALSE)
dadaFs <- dada(filtFs, err=errF, multithread=TRUE)
dadaRs <- dada(filtRs, err=errR, multithread=TRUE)
return(list("dadaFs"=dadaFs, "dadaRs"=dadaRs))
}
Paired-end merging
Forward and reverse reads are now merged together
mergePairs(dadas[["dadaFs"]], filt_paths[["filtFs"]],
dadas[["dadaRs"]], filt_paths[["filtRs"]],
verbose=TRUE))Checking and filtering sequence lenght
Here we check the lenght of the sequences after merging
tar_load(fig_seqlength)
fig_seqlengthWow, all ASVs are pretty much the same size, I don’t think I need to filter more on the length ! I’ll do it anyway just to build the target for future use.
tar_load(seqtab_filt)
seuil <- nchar(rownames(seqtab_filt))After filtering, all sequences are 230 and 270 bp and the dataset has 3915 ASVs.
Removing chimera
rm(seqtab_filt)
tar_load(seqtab_filt_nochim)After removing chimera, the dataset has 1119 ASVs.
Removing low abundace ASVs
I removed ASVs with less than 1000 reads in the whole dataset because these are low abundance in vitro samples so I expect all ASV to be in high abundance at least in the first time point of the experiment. But I might need to go back and change that later.
rm(seqtab_filt_nochim)
tar_load(seqtab_filt_nochim_abund)After removing low abundance reads, the dataset has 44 ASVs.
Read loss
We compute the number of reads at each step of the pipeline:
tar_load(fig_read_loss)
fig_read_lossAssigning taxonomy with my insilico reference database
I first assigned taxnomony using 100% identity with the assign_species() function from dada2, but it seems like I have a errors in sequences that were not detected by the denoising algorithm so I am assigning taxonomy again with BLAST+ and 98% identity. To perform this I am using rBLAST.
tar_load(asv_metadata_taxo)
head(asv_metadata_taxo) ASV_ID
1 ASV_1
2 ASV_2
3 ASV_3
4 ASV_4
5 ASV_5
6 ASV_6
sequence
1 TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAAGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG
2 TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGCGCTTAACGTGGGAACTGCATTTGAAACTGGCAAGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG
3 TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGAGCTTAACTTGGGAACTGCATTTGAAACTGGCAAGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG
4 TACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTGTTAAGTTGGATGTGAAAGCCCCGGGCTCAACCTGGGAACTGCATCCAAAACTGGCAAGCTAGAGTACGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTGATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACAGG
5 TACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGCAGGTGGTTTCTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACACTGAGGCGCGAAAGCGTGGGGAGCAAACAGG
6 TACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTGTTAAGTTGGATGTGAAAGCCCCGGGCTCAACCTGGGAACTGCATCCAAAACTGGCAAGCTAGAGTACAGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTGATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACAGG
abundance prevalence sseqid pident length mismatch gapopen qstart qend
1 4862885 192 FOH354 100.000 253 0 0 1 253
2 2993893 192 FOS446B 100.000 253 0 0 1 253
3 2137064 192 FOS97B 100.000 253 0 0 1 253
4 1368243 190 FOS972 98.814 253 3 0 1 253
5 908920 192 FOH358 100.000 253 0 0 1 253
6 253107 159 FOS972 98.419 253 4 0 1 253
sstart send evalue bitscore multiple_matches
1 273 21 1.41e-135 468 FALSE
2 20 272 1.41e-135 468 TRUE
3 273 21 1.41e-135 468 TRUE
4 273 21 1.41e-130 451 TRUE
5 273 21 1.41e-135 468 FALSE
6 273 21 6.58e-129 446 TRUE
Session info
sessioninfo::session_info()─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.5.0 (2025-04-11)
os Debian GNU/Linux 13 (trixie)
system x86_64, linux-gnu
ui X11
language en_GB:en
collate en_GB.UTF-8
ctype en_GB.UTF-8
tz Europe/London
date 2026-02-05
pandoc 3.6.3 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
quarto 1.8.25 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/quarto
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P backports 1.5.0 2024-05-23 [?] CRAN (R 4.5.0)
P base64url 1.4 2018-05-14 [?] RSPM (R 4.5.0)
P BiocManager 1.30.26 2025-06-05 [?] CRAN (R 4.5.0)
P callr 3.7.6 2024-03-25 [?] RSPM (R 4.5.0)
P cli 3.6.5 2025-04-23 [?] CRAN (R 4.5.0)
P codetools 0.2-20 2024-03-31 [?] CRAN (R 4.3.3)
P data.table 1.17.4 2025-05-26 [?] RSPM (R 4.5.0)
P digest 0.6.37 2024-08-19 [?] CRAN (R 4.5.0)
P dplyr 1.1.4 2023-11-17 [?] CRAN (R 4.5.0)
P evaluate 1.0.3 2025-01-10 [?] CRAN (R 4.5.0)
P farver 2.1.2 2024-05-13 [?] CRAN (R 4.5.0)
P fastmap 1.2.0 2024-05-15 [?] CRAN (R 4.5.0)
P generics 0.1.4 2025-05-09 [?] CRAN (R 4.5.0)
P ggplot2 3.5.2 2025-04-09 [?] CRAN (R 4.5.0)
P glue 1.8.0 2024-09-30 [?] CRAN (R 4.5.0)
P gtable 0.3.6 2024-10-25 [?] CRAN (R 4.5.0)
P htmltools 0.5.8.1 2024-04-04 [?] CRAN (R 4.5.0)
htmlwidgets 1.6.4 2023-12-06 [1] RSPM (R 4.5.0)
P igraph 2.1.4 2025-01-23 [?] CRAN (R 4.5.0)
P jsonlite 2.0.0 2025-03-27 [?] CRAN (R 4.5.0)
P knitr 1.50 2025-03-16 [?] CRAN (R 4.5.0)
P labeling 0.4.3 2023-08-29 [?] CRAN (R 4.5.0)
P lifecycle 1.0.4 2023-11-07 [?] CRAN (R 4.5.0)
P magrittr 2.0.3 2022-03-30 [?] CRAN (R 4.5.0)
P pillar 1.10.2 2025-04-05 [?] CRAN (R 4.5.0)
P pkgconfig 2.0.3 2019-09-22 [?] CRAN (R 4.5.0)
P prettyunits 1.2.0 2023-09-24 [?] RSPM (R 4.5.0)
P processx 3.8.6 2025-02-21 [?] RSPM (R 4.5.0)
P ps 1.9.1 2025-04-12 [?] RSPM (R 4.5.0)
P R6 2.6.1 2025-02-15 [?] CRAN (R 4.5.0)
P RColorBrewer 1.1-3 2022-04-03 [?] CRAN (R 4.5.0)
renv 1.1.5 2025-07-24 [1] RSPM (R 4.5.0)
P rlang 1.1.6 2025-04-11 [?] CRAN (R 4.5.0)
P rmarkdown 2.29 2024-11-04 [?] CRAN (R 4.5.0)
P rstudioapi 0.17.1 2024-10-22 [?] CRAN (R 4.5.0)
P scales 1.4.0 2025-04-24 [?] CRAN (R 4.5.0)
P secretbase 1.0.5 2025-03-04 [?] RSPM (R 4.5.0)
P sessioninfo 1.2.3 2025-02-05 [?] CRAN (R 4.5.0)
P targets * 1.11.3 2025-05-08 [?] RSPM (R 4.5.0)
P tibble 3.2.1 2023-03-20 [?] RSPM (R 4.5.0)
P tidyselect 1.2.1 2024-03-11 [?] CRAN (R 4.5.0)
P vctrs 0.6.5 2023-12-01 [?] CRAN (R 4.5.0)
visNetwork 2.1.4 2025-09-04 [1] RSPM (R 4.5.0)
P withr 3.0.2 2024-10-28 [?] CRAN (R 4.5.0)
P xfun 0.52 2025-04-02 [?] CRAN (R 4.5.0)
P yaml 2.3.10 2024-07-26 [?] CRAN (R 4.5.0)
[1] /home/marine/Documents/1_Postdoc_Birmingham/1_Experiments/250730_SynCom_stability_experiment_data_processing/renv/library/linux-debian-trixie/R-4.5/x86_64-pc-linux-gnu
[2] /home/marine/.cache/R/renv/sandbox/linux-debian-trixie/R-4.5/x86_64-pc-linux-gnu/9a444a72
* ── Packages attached to the search path.
P ── Loaded and on-disk path mismatch.
──────────────────────────────────────────────────────────────────────────────
Update the renv snapshot.
renv::snapshot()- Lockfile written to "~/Documents/1_Postdoc_Birmingham/1_Experiments/250730_SynCom_stability_experiment_data_processing/renv.lock".